Exploratory Data Analysis on Red Wine Composition by Fernando Motta

# Loading the utilized libraries

library("ggplot2")
library("gridExtra")
library("corrplot")
library("magrittr")
library("pander")
library("memisc")
library("dplyr")
# Loading the dataset
data <- read.csv('wineQualityReds.csv')

Univariate Plot Section

First, let us check some basic information and statistics about the data.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

This dataset has 13 variables and 1599 observables. This suggests an ample survey within the realm of port red wine.

Also, to become familiar with the dataset, I will plot the occurrence of the variables before analysing any relationship. To begin with, I will plot the distribution of wine quality.

It stands out from this plot that the range of the ratings is limited between 3 and 8, which leads us to question whether the dataset is complete. Since these are experienced people trying only a specific type of wine (“Vinho Verde”) it seems likely that they are evaluating them with respect to every wine in general. There is also a strong concentration in the ratings 5 and 6. It will be interesting to observe whether the other characteristics appear in similarly tight and sharp ranges.

This distribution is also very concentrated around 8, which is beginning to raise the suspicion that most “Vinho Verde” wines are somewhat similar.

The volatile acidity presents a different characteristic from the ones observed so far, showing a bimodal profile. This distribution presents a few outliers, but they don’t affect the visualization so much, so I’ll keep them.

Citric acid shows a seemingly near random distribution, which can be attributed to the fact that it adds a characteristic to the flavour which is neither good nor bad (freshness), but dependent on the winemaker’s goals.

This is another very skewed distributions, with a few outliers.

This is a distribution very similar to Residual Sugar.

Once again we have a skewed distribution with a long tail and a high peak. We also perceive a few outliers.

It was to be expected that this distribution would be very similar to the one found for free sulphur dioxide, since both quantities are related.

It is noteworthy that this is the first variable to exhibit a distribution that is very similar to the Normal Distribution. The same behaviour is observed for the pH distribution below.

Sulphates present a behaviour we have come to expect, with a long tailed distribution, but it behaves better than the aforementioned distributions in the sense that it has less outliers.

Alcohol is once again similar, but it is not as skewed as the other distributions.

Analysis of the Univariate Plots

There seems to be a large concentration of wines along the central values of the quality rating, which begs the question as to whether this dataset is complete.

It seems natural that pH might have a large influence on the taste, as well as all acidity measures, since acid and basic substances have a characteristic taste. However, some of the present acids might have another flavouring characteristic which is more active than the drop in pH it causes. So it’s important to observe the relationship between each acid and the pH and also between them and the wine rating.

The exception to this might be the citric acid, since its distribution has very particular characteristics.

Apart from citric acid, density and pH, the trend among the other variables seems to be a long tailed skewed distribution with some outliers.

Bivariate Plots

To find out which relations are interesting, the first step would be to create a correlation table which would ease visualization.

#Creation of a correlation table from the data using pander
c <- cor(
  data %>%
    dplyr::select(-X))
#emphasize.strong.cells(which(abs(c) > .3 & c != 1, arr.ind = TRUE))
corrplot(c, method="color")

As water has a greater density than alcohol, it is in accordance to my expectations that Alcohol has a negative correlation with density. The variables most strongly correlated with quality are Volatile acidity and alcohol. The relationship between Fixed acidity and density is also very strong.

To enhance this analysis, box plots are necessary, assessing each characteristic’s association with quality, which is what we want to evaluate.

Fixed acidity does not seem to influence quality.

Volatile acidity seems to be inversely related to the quality of the wine, which one might find surprising due to the correlation between quality and pH. On further analysis, however, one will notice that Volatile acidity is positively correlated with pH (therefore inversely correlated to acidity).

This might mean that volatile acids are actually not kept in the wine, being lost to the air through their volatility and any are not in solution, but rather dispersed.

The correlation between citric acid and quality is positive, which is unexpected due to the citric acid distribution being near square and random.

Residual sugar should not have a very large effect due to the nature of the winemaking process, where it exists as a byproduct of the fermentation. Therefore, the mean should be about the same along the range of qualities, which is what we observe indeed.

There seems to be a small correlation between the reduction of chloride and quality. However, I think this correlation is too small to be representative.

This seems to have an influence. Too low Free sulphur dioxide produces poor wine.

This is expectedly similar to free sulphur, as it is the former’s superset.

Lower density seems to result in better wine. This, however, might be related to the lower density of alcohol.

The pH seems to correlate inversely with the quality, which implies more acidic wines are better.

The correlation seems to be positive, which is normal because the ions are strong, therefore it corresponds with the pH observation.

Alcohol seems to be the main driver to wine quality. However, the amount of outliers suggests there may be another associated variable.

To evaluate this hypothesis, it’s adequate to try to fit a linearisation between alcohol and quality.

## 
## Call:
## lm(formula = as.numeric(quality) ~ alcohol, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8442 -0.4112 -0.1690  0.5166  2.5888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.87497    0.17471   10.73   <2e-16 ***
## alcohol      0.36084    0.01668   21.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared:  0.2267, Adjusted R-squared:  0.2263 
## F-statistic: 468.3 on 1 and 1597 DF,  p-value: < 2.2e-16

The low value observed for R squared implies that other variables also play a large role in the wine quality. To test this, I will make a correlation test between the variables and the quality individually.

##        fixed.acidity     volatile.acidity          citric.acid 
##           0.12405165          -0.39055778           0.22637251 
## log10.residual.sugar      log10.chlordies  free.sulfur.dioxide 
##           0.02353331          -0.17613996          -0.05065606 
## total.sulfur.dioxide              density                   pH 
##          -0.18510029          -0.17491923          -0.05773139 
##      log10.sulphates              alcohol 
##           0.30864193           0.47616632

Some variables were analyzed through their logarithms due to their magnitude range. This test suggests that Alcohol, sulphates, volatile acidity and citric acid have a higher correlation with quality.

Analysis of Bivariate Plots

In addition to the effects with a positive correlation with quality noted above, it is noteworthy that volatile acidity, chloride presence and density seem to have an inverse correlation, while fixed acidity and residual sugar have little effect.

Multivariate Plots

While alcohol is the main driver for quality, it still has a relatively small influence in the wine quality. This can be seen in the R squared factor. Therefore it is necessary to evaluate other factors. To do this we will keep alcohol constant and insert other variables. The variables chosen were the ones shown in the former sections to have the most apparent impact.

Density seems to be irrelevant given constant alcohol, confirming the causal relation between these factors.

It seems that, given constant alcohol, higher levels of sulphates produce better wine.

On the other hand, given constant alcohol, higher volatile acids produce poorer wines.

Expectedly, lower pH with constant alcohol produces better wine. We had previously seen that volatile acidity and pH influence the wine in opposite ways.

Given this data, it’s interesting to generate a model considering the variables which correlate more strongly with quality.

#This block of code generates the data that will allow us to construct the
#table, which will then allow us to visualize which variables correlate more
#strongly with wine quality

set.seed(1056)
training_data <- sample_frac(data, .6)
test_data <- data[ !data$X %in% training_data$X, ]
m1 <- lm(as.numeric(quality) ~ alcohol, data = training_data)
m2 <- update(m1, ~ . + sulphates)
m3 <- update(m2, ~ . + volatile.acidity)
m4 <- update(m3, ~ . + citric.acid)
m5 <- update(m4, ~ . + fixed.acidity)
m6 <- update(m2, ~ . + pH)
mtable(m1,m2,m3,m4,m5,m6)
## 
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = training_data)
## m2: lm(formula = as.numeric(quality) ~ alcohol + sulphates, data = training_data)
## m3: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity, 
##     data = training_data)
## m4: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity + 
##     citric.acid, data = training_data)
## m5: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity + 
##     citric.acid + fixed.acidity, data = training_data)
## m6: lm(formula = as.numeric(quality) ~ alcohol + sulphates + pH, 
##     data = training_data)
## 
## =====================================================================================================
##                          m1            m2           m3           m4           m5            m6       
## -----------------------------------------------------------------------------------------------------
##   (Intercept)           1.891***      1.402***     2.858***     2.919***     2.446***      3.407***  
##                        (0.227)       (0.234)      (0.259)      (0.265)      (0.295)       (0.533)    
##   alcohol               0.360***      0.350***     0.301***     0.301***     0.312***      0.367***  
##                        (0.022)       (0.021)      (0.021)      (0.021)      (0.021)       (0.021)    
##   sulphates                           0.898***     0.543***     0.578***     0.595***      0.769***  
##                                      (0.134)      (0.131)      (0.135)      (0.134)       (0.136)    
##   volatile.acidity                                -1.352***    -1.432***    -1.523***                
##                                                   (0.127)      (0.145)      (0.147)                  
##   citric.acid                                                  -0.147       -0.566**                 
##                                                                (0.133)      (0.176)                  
##   fixed.acidity                                                              0.061***                
##                                                                             (0.017)                  
##   pH                                                                                      -0.635***  
##                                                                                           (0.152)    
## -----------------------------------------------------------------------------------------------------
##   R-squared             0.225         0.260        0.339        0.340        0.348         0.273     
##   adj. R-squared        0.224         0.258        0.337        0.337        0.345         0.271     
##   sigma                 0.716         0.701        0.662        0.662        0.658         0.695     
##   F                   277.605       167.615      163.069      122.635      101.923       119.480     
##   p                     0.000         0.000        0.000        0.000        0.000         0.000     
##   Log-likelihood    -1039.948     -1017.945     -963.755     -963.142     -956.677     -1009.265     
##   Deviance            491.191       469.160      419.025      418.490      412.885       460.744     
##   AIC                2085.897      2043.890     1937.510     1938.284     1927.354      2028.530     
##   BIC                2100.495      2063.353     1961.839     1967.479     1961.415      2052.859     
##   N                   959           959          959          959          959           959         
## =====================================================================================================

Analysis of Multivariate Plots

The variables more strongly correlated with the quality of wine in combination seem to be alcohol. Sulphate and Citric acid. Also, the linear models presented low values for R squared, which suggests that the data might not be significant enough to draw definite conclusions and this study would profit from more extensive datasets.

Final Plots

Given the conclusion that Alcohol and Sulphates affect quality most proeminently, I think that the most important graphics display these characteristics.

This plot shows that higher alcohol percentages generate better wines. Since for each box plot the mean and median concide, we infer that for each particular quality, the alcohol distribution is almost normal, which means that the high median in the top range implies that nearly every high quality wine has high alcohol rating. But, as we mentioned before, the low R squared means that other variables are also relevant.

This plot indicates that high alcohol contents and high sulphate contents produce better wines in combination, which gives a more consistent characteristic to search for in wines.

#Analysis of the predictive model accuracy
df <- data.frame(
  test_data$quality,
  predict(m5, test_data) - as.numeric(test_data$quality)
)
names(df) <- c("quality", "error")
ggplot(data = df, aes(x = quality, y = error)) +
  #geom_jitter(alpha = 0.3) +
  ggtitle("Linear model errors vs expected quality")

This plot serves the purpose of illustrating the errors in the linearisation. One can easily notice that the results are worse for the extreme ranges (poor wine and good wine), which can be attributed to the much higher amount of average samples.

Reflections

This dataset seems to have been treated beforehand, because it is extremely well behaved. Having said that, my biggest difficulty in this analysis was related to the fact that the distribution of wine quality was too centered around the average values, lacking data in the extremes.

This particularity in the dataset hindered the creation of a more accurate model, in my opinion. Another problem I saw was the fact that no single variable correlated strongly with quality, but then again, this is why sophisticated data analysis techniques are developed and employed.

As a future work this analysis might be extended to a larger dataset, possibly including other regions and grapesm favouring a more complete dataset.